How to write SPARQL queries against Freebase data
:BaseKB unleashed
Overview
Freebase RDF data is clean and well-organized, so it can be straightforward to write queries if you understand how. Although a "cookbook" on the subject doesn't yet exist, this post describes the minimum you need to know to write SPARQL queries against Freebase data.
What to load
Although it is possible to load Freebase data directly into a triple store, it is a difficult process because the Freebase RDF dump is not entirely compatible with RDF standards -- many tools will crash or otherwise fail to load the data. The Freebase RDF dump also contains hundreds of millions of redundant or uninteresting triples that greatly increase both loading and query times.
We use the Open Source Infovore framework to produce :BaseKB, a purified data product which is compatible with RDF standard tools.
We've heard reports of people loading :BaseKB Gold into a number of triple stores including Allegrograph, BigData, and OpenLink Virtuoso. This product is a free download via BitTorrent. The ideal hardware for loading this data is a quad core machine with at least 32GB of RAM and SSD storage.
If you'd like to skip the loading step, which can take hours, you can use the RDFeasy Compact Edition in the AWS Marketplace, which combines OpenLink Virtuoso, :BaseKB data and perfectly matched hardware for a low hourly price. This is an excellent option for evaluation, research, and development, because (1) you can get started in ten minutes and (2) you only need to pay for the time when you're using it.
Prefix declaration
:BaseKB rewrites URIs from the http://rdf.freebase.com/ns/
namespace to http://rdf.basekb.com/ns
. Since nearly all of the entities, types, and predicates you'll use come in this namespace, we write
prefix : <http://rdf.basekb.com/ns/>
at the beginning of all queries. If you're using raw data from Freebase, you can write
prefix : <http://rdf.freebase.com/ns/>
and get similar results.
Looking up entities and predicates
Let's try a query I was asked about, which is to find the longest river entirely contained in Russia.
In a prefect world we'd have a :BaseKB-powered schema browser, but for now, we can use the Freebase web interface. Go to
and type the word Russia into the autosuggest at the top. You'll see something like
If you click on the first link, you'll get to the country page for Russia, which is
https://www.freebase.com/m/06bnz
and if you look at the head of the page you will see a mid identifier
You can either read the mid /m/06bnz
from the header on the top of page or from the URL of the page. Either way, to use this as an RDF identifier you replace the first slash with a colon, and the second slash with a period to get
:m.06bnz
Now we also need to find two properties to write this query
- We need a property that states that a location is completely contained in another location, and
- We need a property to find the length of a river.
We can start at the Freebase home page, which lists "bases" that contain common types and properties.
We click on location, and see
We click again on "Location" and then we see a list of properties, the first few of which are
The one we want is Contained-by
and we can rewrite this like so
:location.location.contained_by
Note that most properties in Freebase have this structure. The first part is the name of a 'base', which organizes groups of related types. The second part is the name of a type, which can be referred to as
:location.location
then finally the last part is the name of the property contained_by
. The exception to this rule is that some bases belong to users or are contained in other bases, in which case the name of the base could have multiple parts.
The river length type takes a little more digging, because rivers are not under location
, they are under geography
, which shows up as "Physical Geography" on the top page. You'd first look at
https://www.freebase.com/geography
and then at
https://www.freebase.com/geography/river?schema=
and eventually find the property you want is
:geography.river.length
Putting it all together
Now that we know the properties we need, we can write the following query
prefix : <http://rdf.basekb.com/ns/>
select ?river ?length {
?river :geography.river.length ?length .
?river :location.location.containedby :m.06bnz .
} ORDER BY DESC(?length) LIMIT 1
If you're using RDFeasy, you can run this in the "Database/Interactive SQL" tab by putting the command 'sparql' in front of the SPARQL, which looks like
and then you get this result
If we convert that mid back to a Freebase detail page we get
https://www.freebase.com/m/0203mm
which is the right answer.
Thinking in RDF
Note that we don't need to put
?river a :location.river .
into the query because only a :location.river
can be the subject of :location.river.length
. This isn't just because Freebase types are organized like base -> type -> property, but because RDFS can infer the above a
statement based on
:location.river.length rdfs:domain :location.river .
Much like computer programs (particularly in Java) can grow in verbosity, SPARQL queries can too, and it's wise to leave out any constraints that are unnecessary.
Further reading
It makes sense to read the SPARQL 1.1 specification cover-to-cover, as well as the Metaweb architecture documentation.
This post is the first of a series: future posts will cover compound value types, introspection of the Freebase schema, how to look up identifiers, and other topics. Subscribe to our RSS feed and the :BaseKB mailing list..
Creator of database animals and bayesian brains